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ABSTRACT 

Plenty of algorithms for link prediction have been proposed 
and were applied to various real networks. Among these 
works, the weights of links are rarely taken into account. In 
this paper, we use local similarity indices to estimate the 
likelihood of the existence of links in weighted networks, in- 
cluding Common Neighbor, Adamic-Adar Index, Resource 
Allocation Index, and their weighted versions. In both the 
unweighted and weighted cases, the resource allocation index 
performs the best. To our surprise, the weighted indices per- 
form worse, which reminds us of the well-known Weak Tie 
Theory. Further experimental study shows that the weak 
ties play a significant role in the link prediction problem, 
and to emphasize the contribution of weak ties can remark- 
ably enhance the predicting accuracy. 

Categories and Subject Descriptors 

H. 2.8 [Database Management]: Database Applications- 
Data mining; H.3.3 [Information Storage and Retrieval]: 

Information Search and Retrieval-Information Filtering 

General Terms 

Algorithms, Experimentation. 
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I. INTRODUCTION 

Many complex systems can be well described by networks 
where nodes present individuals or agents, and links denote 
the relations or interactions between nodes. Recently, the 
link prediction of complex networks has attracted more and 
more attention from computer scientists \6[ and physicists [3j 
117] . Link prediction aims at estimating the likelihood of the 
existence of a link between two nodes, based on the observed 
links and the attributes of the nodes. For example, classical 
information retrieval can be viewed as predicting missing 
links between words and documents [18] . and the process of 



recommending items to a user can be considered as a link 
prediction problem in the user-item bipartite network [22] , 

The problem of link prediction can be categorized into two 
classes: One is the prediction of existed yet unknown links 
for sampling networks, such as food webs, protein-protein in- 
teraction networks and metabolic networks; the other is the 
prediction of links that may exist in the future of evolving 
networks, like on-line social networks. In addition, the link 
prediction algorithms can also be used to generate some ar- 
tificial links to help the further network analysis, such as the 
classification problem in partially labeled networks [111 [5]. 
Some algorithms based on Markov chains |191I23| [2"] and ma- 
chine learning 16 , 20 have been proposed recently, and an- 
other group of algorithms are based on the definition of node 
similarity. In this paper, we concentrate on the latter. Node 
similarity can be defined by using the essential attributes of 
nodes, namely two nodes are considered to be more similar 
if they have many common features. However, the essential 
features of nodes are usually not available, and thus the 
mainstream of similarity-based link prediction algorithms 
consider only the observed network structure. Liben-Nowell 
and Kleinberg [9] systematically compared some structure- 
based node similarity indices for link prediction problem in 
co-authorship networks, and Zhou et al. [211 \W\ studied 
nine well-known local similarity indices on six real networks 
extracted from disparate field, as well as proposed two new 
local indices. 

Up to now, most studies of link prediction do not take 
weights of links into consideration. Murata et al. [12[ pro- 
posed three weighted similarity indices, as variants of Com- 
mon Neighbors, Adamic-Adar and Preferential Attachment 
indices respectively. They applied these indices to the net- 
works of Question-Answer Bulletin Boards System, and the 
results show that with the consideration of weights the pre- 
diction accuracy can be enhanced. To our surprise, when 
we apply the weighted indices to co-authorship networks and 
the US air transportation network, we find that the weighted 
indices perform even worse than the unweighted ones. Actu- 
ally, Liben-Nowell and Kleinberg [9] reported the similar ob- 
servation for weighted Katz index. These unexpected results 
remind us of the well-known Weak Tie Theory [7]. Further 
experimental study shows that the weak ties play a signifi- 
cant role in the link prediction problem, and to emphasize 
the contribution of weak ties can remarkably enhance the 
predicting accuracy. 



Table 1: Algorithmic accuracy, measured by precision. Each number is obtained by averaging over 100 
implementations with independently random divisions of testing set and probe set. The numbers inside 
the brackets denote the standard derivations. For example, 0.592(48) means the precision is 0.592, and the 
standard derivation is 0.048. The abbreviation, WCN*, WAA* and WRA*, represents the highest precisions 
obtained by Eqs. (7-9), respectively. The corresponding optimal values of a are shown in Table 2. 
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2. DATA AND METHOD 

Considering an undirected simple network G(V,E), where 
V is the set of nodes and E is the set of links. The multiple 
links and self-connections are not allowed. For each pair of 
nodes, x,y £ V, we assign a score, s xy , according to a given 
similarity measure. Higher score means higher similarity 
between these two nodes, and vice versa. Since G is undi- 
rected, the score is supposed to be symmetry, say s xy = s yx . 
All the nonexistent links are sorted in the decreasing or- 
der according to their scores, and the links in the top are 
most likely to exist. To test the algorithmic accuracy, the 
observed links, E, is randomly divided into two parts: the 
training set, E T , is treated as known information, while the 
probe set, E p , is used for testing and no information therein 
is allowed to be used for prediction. Clearly, E — E T U E p 
and E T n E p = 0. In this paper, the training set always 
contains 90% of links, and the remaining 10% of links con- 
stitute the probe set. To quantify the prediction accuracy, 
we use a standard metric called precision, which is defined 
as the ratio of relevant items selected to the number of items 
selected. We focus on the top L predicted links (in this pa- 
per, we set L = 100), if there are L r relevant links (i.e., the 
links in the probe set), the precision equals L r /L. Clearly, 
higher precision means higher prediction accuracy. 

The empirical data used in this paper include (i) USAir. — 
the US air transportation network, which contains 332 air- 
ports and 2126 airlines (see Pajak Datasets). The weight of 
a link is the frequency of flights between two airports, (ii) 
NetScience. — the co-authorship network of 1589 scientists 
who are themselves working on network science |14j . Here, 
the weight between two scientists is not simply the number 
of papers they co-authorized. According to [13] , if a paper 
has n coauthors, then the weight of each pair of authors 
contributed by this paper is l/(ra — 1). For two scientists, 
the final weight of their link is obtained by summing up the 
weights contributed by all their co-authorized papers, (iii) 
CGScience. — the co-authorship network in computational 
geometry till February 2002 (see Pajek Datasets). This net- 
work contains 7343 authors and 11898 links. Two authors 
are linked if they wrote at least a common paper/book. The 
weight of a link is assigned by directly counting the number 
of common papers/books. 

3. UNWEIGHTED SIMILARITIES BASED 
ON LOCAL INFORMATION 

Among many similarity indices, Liben-Nowell and Kleinberg 
[5] showed that the Common Neighbors (CN) and Adamic- 
Adar (AA) index [I] perform the best, which has been fur- 
ther demonstrated by systematically comparing CN, AA in- 
dex with seven other well-known local similarity indices [21] . 



In addition, Zhou et al. 21 proposed a new index named 
Resource Allocation (RA) index, which can beat both CN 
and AA index. Therefore, in this paper, we concentrate on 
CN, AA index and RA index, whose definitions are as fol- 
lowing. 

(i) CN. In common sense, two nodes, x and y, are more 
likely to form a link in the future if they have many common 
neighbors. Let T(x) denote the set of neighbors of node x. 
The simplest measure of the neighborhood overlap is the 
directed count: 



»«» = |r(i)nr(v)| ) 

where \Q\ is the cardinality of the set Q. 
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(ii) AA index. It refines the simple counting of common 
neighbors by giving the lower-connected neighbors more weights, 
as: 

s xy = / - i / \ J 

— ' logKLZ) 

2er(x)nr(!/) 6 ^ > 
where k(z) is the degree of node z, namely k(z) = |r(z)|. 

(iii) RA index. Considering a pair of nodes, x and y, which 
are not directly connected. The node x can send some re- 
source to y, with their common neighbors playing the role 
of transmitters. In the simplest case, we assume that each 
transmitter has a unit of resource, and will equally distribute 
to all its neighbors. As a results the amount of resource y 
received is defined as the similarity between x and y, which 
is: 



ry ~~ ^ k(z)' 



(3) 



Empirical analysis shows that |21j comparing with CN and 
AA, RA can enhance the prediction accuracy measured by 
the area under a receiver operating characteristic curve (AUC) 
[8], especially for the networks with large average degrees 
(in such cases, the difference between RA and AA is large). 
AUC takes into account the whole ranking, while precision 
only concentrates on the top L predicted links. As shown in 
Table 1, subject to precision, RA still performs remarkably 
better than CN and AA. Here comes a simple but significant 
result, the RA index outperforms CN and the AA index, 
and thus can find its applications in better characterize the 
proximity of nodes in networks. 

4. WEIGHTED SIMILARITIES 

The similarity indices mentioned in the last section only con- 
sider the binary relations among nodes, however, in the real 
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Figure 1: Precision as a function of a for USAir, 
NetScience and CGScience: CN (triangles), A A 
(circles) and RA (squares). The inset in the plot for 
CGScience shows the precision of CN for a G [—5, 1]. 
Each data point is obtained by averaging over 100 
realizations, each of which corresponds to an inde- 
pendent division of training set and testing set. 



world, links are naturally weighted, which may represent the 
transportation load between two airports in a airline network 
or the number of co-authorized papers in a co-authorship 
network. We expect the similarity index taking into ac- 
count link weights can give better predictions. Murate and 
Moriyast [12] proposed a simple way to extend a similarity 
index for binary network to a weighted index. Following this 
method, the weighted CN, weighted AA index and weighted 
RA index (denoted by WON, WAA and WRA, respectively) 
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Here, w(x,y) — w(y,x) denotes the weight of link between 
nodes x and y, and s(x) — Ylzerfx) w ( x > z ) is the strength 
of node x. Note that, since s(z) may be smaller than 1 we 
use log(l + s(z)) in Eq. (5) to avoid negative score. 

To our surprise, when we apply the weighted indices to the 
three experimental networks, as shown in Table 1, we find 
that except WRA in NetScience, the weighted indices per- 
form obviously worse than their corresponding unweighted 
versions. Especially for CN, with consideration of the weights 
the precisions are sharply decreased. These unexpected re- 
sults remind us of the well-known Weak Ties Theory [7J, 
which states that the people usually obtain useful informa- 
tion or opportunities through the acquaintances but not the 
close friends, namely the weak ties in their friendship net- 
work paly a significance role. Recently, Onnela et al. [TH] 
demonstrated that the weak ties mainly maintain the con- 
nectivity in mobile communication networks, and Csermely 
found that the weak ties may maintain the stability of bio- 
logical systems [4]. In contrast, the role of weak ties in link 
prediction problem has not been investigated yet. 

5. ROLE OF WEAK TIES 

In this section, we provide a start point to investigate the 
role of weak ties in link prediction by introducing a free 
parameter, a, to control the relative contributions of weak 
ties to the similarity measure. The parameter-dependent 
indices for WCN, WAA and WRA are: 
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where s(x) = Ylzevtx) w ( x > z )° ■ When a = 0, s(x) is the de- 
gree of node x, and the indices degenerate to the unweighted 
cases. When a = 1, the indices is equivalent to the simply 
weighted indices, as shown in Eqs. (4-6). The numerical 



Table 2: Optimal values of the parameter a subject 
to the highest precisions. For CGScience, with the 
decreasing of a the precision increases monotonously 
and eventually reaches a stable value, 0.782, at the 
point a = —4.15. 
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results are given in Figure 1, Table 1 and Table 2. For all 
cases, the optimal values of a are all smaller than 1. That is 
to say, the weak links play more important role in the link 
prediction than indicated by their weights. A big surprise is 
that sometimes the optimal values of a are negative, that is 
to say, the weak links actually play more important role than 
the strong links. Although it is well-known that the weak 
ties mainly maintain the network connectivity, this result is 
still striking for us. 

6. CONCLUSIONS AND DISCUSSION 

In this paper, we applied three local similarity indices, Com- 
mon Neighbor, Adamic-Adar index and Resource Allocation 
index, to the link prediction problem of three empirical net- 
works, USAir, NetScience and CGScience. We found that 
our previously proposed index, RA [2T], performs the best. 
Furthermore, with the consideration of weights, we tested 
three weighted variants of CN, AA and RA, denoted by 
WCN, WAA and WRA. To our surprise, the precision of 
weighted indices perform even worse than their correspond- 
ing unweighted versions. These unexpected results remind 
us the weak ties theory [7] which claims that the links with 
small weights yet play an important role in social network. 
Extensive experimental study shows that the weak ties play 
a significant role in the link prediction problem, and to em- 
phasize the contribution of weak ties can remarkably en- 
hance the predicting accuracy. Sometimes, in the optimal 
cases, the weak ties contribute more than the strong ties. 
In another word, the weak links in such network are not as 
weak as their weights suggested. 

Although the prediction accuracies of both the unweighted 
indices (Eqs. (1-3)) and the simply weighted indices (Eqs. 
(4-6)) can be further improved by introducing the parame- 
ter a (Eqs. (7-9)), this paper does not aim at highlighting 
these parameter-dependent indices. Instead, we attempt to 
uncover the role of weak ties in the link prediction problem. 
We hope this paper can provide a start point for the possible 
weak ties theory in information retrieval. 
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